Summarizing and grouping Star Wars data

DSST 289: Introduction to Data Science

Erik Fredner

2024-09-11

Overview

  • Homework
  • 🧶Knitting
  • Exam overview
  • Summarizing data with summarize()
  • Grouping data with group_by()
  • Handling missing values
  • Visualizing summaries with ggplot2

Homework

library(tidyverse)
library(ggrepel)
library(knitr)

hans <- read_csv("../data/hans_roslin.csv")
hans |>
  select(country, continent, year, life_exp, gdp, pop) |> 
  # get a random sample of rows:
  slice_sample(n = 2) |>
  kable()
country continent year life_exp gdp pop
Paraguay Americas 1982 66.874 4258.504 3366439
Paraguay Americas 1962 64.361 2148.027 2009813

1. Compute the average life expectancy from the year 2007.

hans |>
  filter(year == 2007) |>
  summarize(mean_life_exp = mean(life_exp))
# A tibble: 1 × 1
  mean_life_exp
          <dbl>
1          67.0
hans |>
  filter(year == 2007) |>
  pull(life_exp) |>
  mean()
[1] 67.00742

2. Compute the average GDP from the year 2007.

hans |>
  filter(year == 2007) |>
  summarize(mean_gdp = mean(gdp))
# A tibble: 1 × 1
  mean_gdp
     <dbl>
1   11680.
hans |>
  filter(year == 2007) |>
  pull(gdp) |>
  mean()
[1] 11680.07

3. Compute the average life expectancy of each continent in the year 2002.

hans |>
  filter(year == 2002) |>
  group_by(continent) |>
  summarize(mean_life_exp = mean(life_exp))
# A tibble: 5 × 2
  continent mean_life_exp
  <chr>             <dbl>
1 Africa             53.3
2 Americas           72.4
3 Asia               69.2
4 Europe             76.7
5 Oceania            79.7

4. Compute the total number of people living in each continent in the year 1957.

hans |>
  filter(year == 1957) |>
  group_by(continent) |>
  summarize(total_pop = sum(pop))
# A tibble: 5 × 2
  continent  total_pop
  <chr>          <dbl>
1 Africa     264837738
2 Americas   386953916
3 Asia      1562780599
4 Europe     437890351
5 Oceania     11941976

5. Compute the total number of countries in each continent

Warning

This is wrong! (But the way it’s wrong is instructive.)

hans |>
  group_by(continent) |>
  summarize(n_countries = n())
# A tibble: 5 × 2
  continent n_countries
  <chr>           <int>
1 Africa            624
2 Americas          300
3 Asia              396
4 Europe            360
5 Oceania            24

Why is this wrong? Because we have many repeated observations of both continent and country. For example:

hans |> 
  group_by(continent) |> 
  filter(country == "Cuba") |> 
  select(continent, country, year) |> 
  slice_head(n = 5)
# A tibble: 5 × 3
# Groups:   continent [1]
  continent country  year
  <chr>     <chr>   <dbl>
1 Americas  Cuba     1952
2 Americas  Cuba     1957
3 Americas  Cuba     1962
4 Americas  Cuba     1967
5 Americas  Cuba     1972

5. Compute the total number of countries in each continent

Note

These are all correct.

hans |>
  group_by(continent) |>
  summarize(n_countries = n_distinct(country))
# A tibble: 5 × 2
  continent n_countries
  <chr>           <int>
1 Africa             52
2 Americas           25
3 Asia               33
4 Europe             30
5 Oceania             2
hans |> 
  distinct(continent, country) |> 
  count(continent)
# A tibble: 5 × 2
  continent     n
  <chr>     <int>
1 Africa       52
2 Americas     25
3 Asia         33
4 Europe       30
5 Oceania       2
hans |> 
  distinct(continent, country) |>
  group_by(continent) |> 
  summarize(n_countries = n())
# A tibble: 5 × 2
  continent n_countries
  <chr>           <int>
1 Africa             52
2 Americas           25
3 Asia               33
4 Europe             30
5 Oceania             2
hans |>
  # this is hacky but works
  # because there's one observation per country per year:
  filter(year == 2007) |>
  group_by(continent) |>
  summarize(n_countries = n())
# A tibble: 5 × 2
  continent n_countries
  <chr>           <int>
1 Africa             52
2 Americas           25
3 Asia               33
4 Europe             30
5 Oceania             2

🧶Knitting

  • .Rmd files can produce pretty .html documents.
  • These documents can be viewed in your web browser.
  • That makes them ideal for sharing.

How to knit

  1. Open your notebook from last class: notebook04.Rmd
  2. Click the Knit button in RStudio:

Knit button

Where knitted documents go

Knitted documents appear as .html files in the same directory as the .Rmd file:

Knit output

Viewing your knitted documents

Click on the .html file, and select “view in web browser:”

View knitted files in browser

Practice

  1. Open your .Rmd notebook from last class.
  2. Knit it.
  3. View the .html file in your browser.
  4. If you have trouble, let me know!

Why knit?

  • You will need to knit your exams.
  • .html files are easy to share.
    • They combine code, data, and outputs into a single document.

Exam overview

  • Two parts:
    • Take-home on a computer with open notes
    • In-class on paper without notes
  • Take-home exam resembles the notebooks we do in class.
    • Available on Monday, Sep. 16 after class
    • Due at the start of class on Wednesday, Sep. 18
    • Late take-home exams automatically drop your exam at least one letter grade

Exam topics

R Basics

  • Assigning variables <-
  • Using pipes |>
  • packages in the tidyverse: ggplot, dplyr, etc.
  • read_csv()
  • c()

Plotting

  • ggplot()
    • aes(): x, y, color, size
    • geom_: point, line, bar, text
      • alpha (transparency)
    • scale_: x, y, color, viridis
  • fct_inorder()
  • Fixed vs. non-fixed aesthetics

Data wrangling 1

  • select()
    • distinct()
  • filter() (==, !=, >, %in%, etc.)
    • filter(between())
  • arrange()
    • arrange(desc())

Data wrangling 2

  • group_by()
    • summarize()
    • mean(), min(), max(), n(), n_distinct()
  • mutate()
    • if_else()
    • case_when()
  • count()

Study tips

  1. Practice the material in the preceding slides.
  2. Review the notes.
  3. Review the slides.
  4. Review the notebook solutions.
  5. Redo blank notebooks.
  6. If you’ve done all that and you still want more practice, you can try asking the class’s Custom GPT for sample problems.

summarize()

summarize() reduces elements of a data frame with one of several summary functions. It is a dplyr verb.

Star Wars data

starwars <- read_csv(file.path("..", "data", "starwars.csv"))

starwars |>
  select(name, height, mass, sex) |>
  slice_head(n = 5) |>
  # make the tables look nice on the slides:
  kable()
name height mass sex
Luke Skywalker 172 77 male
C-3PO 167 75 none
R2-D2 96 32 none
Darth Vader 202 136 male
Leia Organa 150 49 female

summarize-ing height

starwars |>
  summarize(mean_height = mean(height, na.rm = TRUE))
# A tibble: 1 × 1
  mean_height
        <dbl>
1        175.

Why na.rm = TRUE?

The na.rm argument is used to remove NA values from the calculation. It would be inaccurate to assume that these should be 0 (or any other value). If na.rm = FALSE, the result will be not available:

starwars |> 
  summarize(mean_height = mean(height, na.rm = FALSE))
# A tibble: 1 × 1
  mean_height
        <dbl>
1          NA

How to find na values?

This is a common real world problem.

# Which star wars characters have missing heights?
starwars |>
  # is.na() returns TRUE if the value is NA:
  filter(is.na(height)) |>
  select(name, height) |>
  kable()
name height
Arvel Crynyd NA
Finn NA
Rey NA
Poe Dameron NA
BB8 NA
Captain Phasma NA

summarize-ing height and mass

starwars |>
  summarize(
    mean_height = mean(height, na.rm = TRUE),
    mean_mass = mean(mass, na.rm = TRUE)
  )
# A tibble: 1 × 2
  mean_height mean_mass
        <dbl>     <dbl>
1        175.      97.3

group_by

  • summarize-ing single columns is not very interesting.
  • summarize() gets interesting when we combine it with group_by().

height by sex

starwars |>
  group_by(sex) |>
  # summarize within groups:
  summarize(mean_height = mean(height, na.rm = TRUE)) |>
  arrange(mean_height) |>
  kable()
sex mean_height
none 131.2000
female 171.5714
hermaphroditic 175.0000
NA 175.0000
male 179.1228

group_by multiple variables

starwars |>
  group_by(sex, species) |>
  summarize(mean_height = mean(height, na.rm = TRUE)) |>
  arrange(mean_height) |>
  kable()
sex species mean_height
male Yoda’s species 66.0000
male Aleena 79.0000
male Ewok 88.0000
male Vulptereen 94.0000
male Dug 112.0000
male Xexto 122.0000
none Droid 131.2000
male Toydarian 137.0000
male Sullustan 160.0000
male Toong 163.0000
female Human 163.5714
female Clawdite 168.0000
female Mirialan 168.0000
male Rodian 173.0000
male Zabrak 173.0000
hermaphroditic Hutt 175.0000
NA NA 175.0000
female Togruta 178.0000
female Twi’lek 178.0000
male Mon Calamari 180.0000
male Twi’lek 180.0000
male Human 182.3913
male Geonosian 183.0000
female Tholothian 184.0000
male Iktotchi 188.0000
male Kel Dor 188.0000
male Trandoshan 190.0000
male Muun 191.0000
male Neimodian 191.0000
male Skakoan 193.0000
male Chagrian 196.0000
male Nautolan 196.0000
male Besalisk 198.0000
male Cerean 198.0000
male Pau’an 206.0000
male Gungan 208.6667
female Kaminoan 213.0000
male Kaleesh 216.0000
male Kaminoan 229.0000
male Wookiee 231.0000
male Quermian 264.0000

additional group operations

n() returns how many rows are in each group

starwars |>
  group_by(species) |>
  summarize(n = n()) |>
  arrange(desc(n)) |>
  kable()
species n
Human 35
Droid 6
NA 4
Gungan 3
Kaminoan 2
Mirialan 2
Twi’lek 2
Wookiee 2
Zabrak 2
Aleena 1
Besalisk 1
Cerean 1
Chagrian 1
Clawdite 1
Dug 1
Ewok 1
Geonosian 1
Hutt 1
Iktotchi 1
Kaleesh 1
Kel Dor 1
Mon Calamari 1
Muun 1
Nautolan 1
Neimodian 1
Pau’an 1
Quermian 1
Rodian 1
Skakoan 1
Sullustan 1
Tholothian 1
Togruta 1
Toong 1
Toydarian 1
Trandoshan 1
Vulptereen 1
Xexto 1
Yoda’s species 1

paste() collapses values in character colums to strings

starwars |>
  group_by(species) |>
  # you can use any string as a separator:
  summarize(names = paste(name, collapse = "️👽🛸🚀"))
# A tibble: 38 × 2
   species   names                                                      
   <chr>     <chr>                                                      
 1 Aleena    Ratts Tyerel                                               
 2 Besalisk  Dexter Jettster                                            
 3 Cerean    Ki-Adi-Mundi                                               
 4 Chagrian  Mas Amedda                                                 
 5 Clawdite  Zam Wesell                                                 
 6 Droid     C-3PO️👽🛸🚀R2-D2️👽🛸🚀R5-D4️👽🛸🚀IG-88️👽🛸🚀R4-P17️👽🛸🚀BB8
 7 Dug       Sebulba                                                    
 8 Ewok      Wicket Systri Warrick                                      
 9 Geonosian Poggle the Lesser                                          
10 Gungan    Jar Jar Binks️👽🛸🚀Roos Tarpals️👽🛸🚀Rugor Nass            
# ℹ 28 more rows

groupby() |> summarize() |> ggplot()

height and mass grouped by sex

starwars |>
  filter(!is.na(sex)) |> 
  group_by(sex) |>
  summarize(
    mean_height = mean(height, na.rm = TRUE),
    mean_mass = mean(mass, na.rm = TRUE),
    n = n()
  ) |>
  ggplot(aes(x = mean_height, y = mean_mass, size = n)) +
  geom_point() +
  geom_text_repel(aes(label = sex), size = 5)

Who are the outliers?

starwars |>
  filter(sex %in% c("hermaphroditic", "none")) |>
  select(name, height, mass, sex) |>
  kable()
name height mass sex
C-3PO 167 75 none
R2-D2 96 32 none
R5-D4 97 32 none
Jabba Desilijic Tiure 175 1358 hermaphroditic
IG-88 200 140 none
R4-P17 96 NA none
BB8 NA NA none

When summarizing groups

  • Be aware of missing data
  • Be aware of outliers
  • Be aware of how many items are in each group
    • The Jabba the Hutt problem
  • Make sure that groups are comparable

Summary

  • Use summarize() to compute summary statistics.
  • Combine summarize() with group_by() to perform group-wise summaries.
  • Beware missing data! na, None, etc.
    • One way to handle missing data for grouping operations: na.rm = TRUE
  • Visualize group summaries with ggplot2